Parle: parallelizing stochastic gradient descent
نویسندگان
چکیده
We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4× faster than a data-parallel implementation of SGD, while achieving significantly improved error rates that are nearly state-of-the-art on several benchmarks including CIFAR-10 and CIFAR-100, without introducing any additional hyper-parameters. We exploit the phenomenon of flat minima that has been shown to lead to improved generalization error for deep networks. Parle requires very infrequent communication with the parameter server and instead performs more computation on each client, which makes it well-suited to both single-machine, multi-GPU settings and distributed implementations.
منابع مشابه
Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms
Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems. Splash consists of a programming interface and an execution engine. Using the programming interface, the user develops sequential stochastic algorithms without ...
متن کاملSpeeDO: Parallelizing Stochastic Gradient Descent for Deep Convolutional Neural Network
Convolutional Neural Networks (CNNs) have achieved breakthrough results on many machine learning tasks. However, training CNNs is computationally intensive. When the size of training data is large and the depth of CNNs is high, as typically required for attaining high classification accuracy, training a model can take days and even weeks. In this work, we propose SpeeDO (for Open DEEP learning ...
متن کاملParallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging
This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work sharply analyzes: (1) mini-batching, a method of averaging many samples of the gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few i...
متن کاملMulti-GPU Training of ConvNets
In this work, we consider a standard architecture [1] trained on the Imagenet dataset [2] for classification and investigate methods to speed convergence by parallelizing training across multiple GPUs. In this work, we used up to 4 NVIDIA TITAN GPUs with 6GB of RAM. While our experiments are performed on a single server, our GPUs have disjoint memory spaces, and just as in the distributed setti...
متن کاملParallelizing Big Data Machine Learning Algorithms with Model Rotation
This paper investigates a novel approach to parallelization of machine learning algorithms using model rotation as an effective parallel computation model. We identify the importance of model rotation owing to its ability to shift the latest model updates to a neighboring computation, thereby guaranteeing model consistency which is hard to achieve in other computation models. We distinguish com...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1707.00424 شماره
صفحات -
تاریخ انتشار 2017